Intro to RNA-seq Analysis

Hosted by the IIGB Bioinformatics Core

Brandon Le

Genomics Building 1207G

October 18-19, 2023

About the Core

IIGB Bioinformatics Core

  • Data analysis
    • RNA-seq, ChIP-seq
    • scRNA-seq
    • DNA-seq
    • Other
  • Project consultations
  • Method development
  • Daily (2-3PM)
  • In-person or via Zoom (by appt)
  • Calendly

Workshop Outline

Day One

  • Basic introduction to RNA-seq
  • Data analysis setup
  • Data QC
  • Sequence alignment

Day Two

  • Data pre-processing
  • Differential expression analysis

Day One

What is RNA-seq?

RNA sequencing or RNA-seq is one of many methods used for gene expression studies by obtaining a snapshot of the RNA molecules within a biological system.

Van den Berge et. al (2019) Annu Rev Biomed Data Sci

What is RNA-seq?

Next-generation sequencing (NGS) technologies

  • Short-read based (e.g. Illumina)
    • 35 - 300 bases
  • Long-read based (e.g., PacBio, Oxford Nanopore)
    • Several kilobases

Sequencing resolution

  • Bulk tissue (bulk RNA-seq)
  • Laser-capture microdissection (LCM + RNA-seq)
  • Single-cell (scRNA-seq)
  • Spatial transcriptomics (spatial + scRNA-seq)

Logging in to the server


You should receive an email with login information if you’ve never had an account on the HPCC. For those with an account already, you can use the same credentials to log into the server.

Open up your terminal program (on your Mac or PC) and at the $ prompt, enter the following command (change ‘username’ to your account name)


ssh username@cluster.hpcc.ucr.edu

Setting up the analysis directory

We will clone the GitHub repository for this workshop. The repo is stored at:

GitHub Repo


Cloning the repository:

git clone https://github.com/bioinformatics-workshop/RNA-Seq-Workshop.git

Update changes made to the repository:

git pull


The directory structure should look like this:

├── analysis
   ├── deseq2
   ├── fastqc
   ├── featurecounts
   ├── multiqc
   ├── star
   └── trim_galore
├── code
├── genome
├── index
├── log
└── raw

Create conda environment

We will create a conda environment containing bioinformatics software packages not readily available on the cluster for data processing and analysis. This conda environment includes special R packages (e.g., EnhancedVolcano) and software (e.g.,multiqc).


To create the conda environment:

conda create --name DEG-analysis --file code/conda_spec_file.txt

To use the conda environment:

conda activate DEG-analysis

Analysis Toolkit

Quality Control

Sequence Alignment

Quantification

Differential expression

Data Visualization

R and R packages

Sample Dataset Demo


BioProject: PRJNA950346

GEO Series: GSE228555

Title: Transcriptome expression of WT and mir163 mutant

Organism: Arabidopsis thaliana

Ecotype: Col-0

Genotype(s): WT, miR163 mutant

Biological replicates: 3

Tissue: Seedlings

Library kit: NEBNext® Ultra™ RNA Library Prep Kit for Illumina (non-stranded)

Sequencing: Illumina paired-end 150bp

Workflow Outline

  • Create metadata
  • QC using Fastqc and trim_galore
  • Genome indexing for STAR aligner
  • Sequence alignment with STAR
  • Gene quantification using featureCounts
  • Differential expression analysis with DESeq2
  • Data visualization in IGV

Workflow - Create metadata

Generate a metadata.csv file containing information about the dataset. The metadata will be used for processing and differential expression analysis.

The metadata for this dataset are:

srr_id,ecotype,genotype,treatment,tissue,gsm_id,biorep,samplename

srr_id
ecotype
genotype
treatment
tissue
biorep
samplename
fq1
fq2
SRR ID from the SRA run
Col-0
WT, miR163_mut
7-day-old plants
seedlings
1,2,3
wt_1,wt_2,wt_3,mir163_1,mir163_2, mir163_3
raw/SRRxxxx_1.fastq.gz
raw/SRRxxxx_2.fastq.gz

Note

An extensive metadata file describes the samples in detail. However, the minimum metadata file should contain the following:

  • samplename (sometime it’s the same name as the fastq file),
  • condition/treatment/genotype
  • fq1/fq2

Other metadata info can include:

  • time point
  • technical replicate
  • sequencing batch
  • library kits


srr_id,ecotype,genotype,treatment,tissue,biorep,samplename,fq1,fq2
SRR24016000,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,1,mir163_1,raw/SRR24016000_1.fastq.gz,raw/SRR24016000_2.fastq.gz
SRR24016001,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,2,mir163_2,raw/SRR24016001_1.fastq.gz,raw/SRR24016001_2.fastq.gz
SRR24016002,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,3,mir163_3,raw/SRR24016002_1.fastq.gz,raw/SRR24016002_2.fastq.gz
SRR24016003,Col-0,wt,7-day-old seedlings without treatment,seedlings,1,wt_1,raw/SRR24016003_1.fastq.gz,raw/SRR24016003_2.fastq.gz
SRR24016004,Col-0,wt,7-day-old seedlings without treatment,seedlings,2,wt_2,raw/SRR24016004_1.fastq.gz,raw/SRR24016004_2.fastq.gz
SRR24016005,Col-0,wt,7-day-old seedlings without treatment,seedlings,3,wt_3,raw/SRR24016005_1.fastq.gz,raw/SRR24016005_2.fastq.gz


The metadata_sub1M.csv contains info referencing the sub-sampled dataset

srr_id,ecotype,genotype,treatment,tissue,biorep,samplename,fq1,fq2
SRR24016000_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,1,mir163_1,raw/SRR24016000_sub1M_1.fastq.gz,raw/SRR24016000_sub1M_2.fastq.gz
SRR24016001_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,2,mir163_2,raw/SRR24016001_sub1M_1.fastq.gz,raw/SRR24016001_sub1M_2.fastq.gz
SRR24016002_sub1M,Col-0,miR163_mut,7-day-old seedlings without treatment,seedlings,3,mir163_3,raw/SRR24016002_sub1M_1.fastq.gz,raw/SRR24016002_sub1M_2.fastq.gz
SRR24016003_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,1,wt_1,raw/SRR24016003_sub1M_1.fastq.gz,raw/SRR24016003_sub1M_2.fastq.gz
SRR24016004_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,2,wt_2,raw/SRR24016004_sub1M_1.fastq.gz,raw/SRR24016004_sub1M_2.fastq.gz
SRR24016005_sub1M,Col-0,wt,7-day-old seedlings without treatment,seedlings,3,wt_3,raw/SRR24016005_sub1M_1.fastq.gz,raw/SRR24016005_sub1M_2.fastq.gz

Anatomy of a FASTQ file

Fastq files contain the raw sequence and the base quality score. Each sequence is comprised of four lines.


The sequence header (starts with @) and contain identifiers for the read
The sequence
'+'
Base quality scores 


@VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG
NATGGGACAGACATGCTGGCGGCACTCACTCACTTGGGCGGCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGATCTCGTATTCCCTTTTTTTTTTGTAAATTTTTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#-C;;CCCCC-C-CCCCCCCCC-CCCCCCCCCCCCCCCCCCC-CCCCCCCC-CC-CCCCCCCCC-CC-CCCCCCCCC;-C--CC-CCCC--CC-C--;---;C----;-;CC-------C-C-C-C--CCCC-;C;CCC-CCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:20125:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCCAGCCCCAGCGACTCCTAATAAAGCATTTCAGCAAATAAAAAAAAAAAAAAAAAGATCGGAAGAGCACACGTCTGAACTCAAGTCACAGTTCAGGATGTGGTTTTTGGTTTTTTTTTTTTTAAATTTTGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC--CCCCCCCCCCCCCCCC;CCCC-CCCCCCCCCC---C-C---C-CC-CCCCC;CC-CC-C--;C-CC-------C-C---C-C----CC;C-CCCC;C;CCCCCCCCCCCC
@VH01192:45:AAC7JMMM5:1:1101:21034:1000 1:N:0:AGTTCAGG+TCTGTTGG
NTTGCAATGCTCAATAAGTCTATTCCACCTCAGTGTCCTTTTTAAAGAGTTTTGGAAAAAAAAAAAAAAAAAAAGATCGTAAGAGCACACGTCTGAACTCCAGTCACAGTTCAGGTTGTGGGTTTTCGTGTTTTTGTTTTTATTTTTGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC-CCCCC;C;CCCCCCCCCCCCCCCCCCCCCCCCCC-CC;C-CC-C;CCCCCC;C;CC-CCCCC;;;CCC-;;C;-;-C---C-C;;;--C-CCCC---C;C-----C;--C-
@VH01192:45:AAC7JMMM5:1:1101:21488:1000 1:N:0:AGTTCAGG+TCTGTTGG
NCCTCAAAAAAAAAAAAAAAAAAAAAAAAAATTTGGTATGTGAAATTTTTTTAATACATTTAAATTTTATGTTTTTGTTTTTCTTTTTTTTTTTTTTAAATATTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
+
#CCCCCCCCCCCCCCCCCCCCCCCCCCCCC-;---C-C;-;------CCCCCC;--CC--C-C-;-C-C--;C;-----C-C--CCCC-C--;C-C---C-;C;-C-C;CCCCCCCCC;CCCCC-CCCCCCCCCCC;C;CCCCCCCC;CCC

Anatomy of a FASTQ file header


@VH01192:45:AAC7JMMM5:1:1101:19973:1000 1:N:0:AGTTCAGG+TCTGTTGG

VH01192
45
AAC7JMMM5
1
1101
19973
1000
1
N
0
AGTTCAGG+TCTGTTGG
unique instrument id
run id
flowcell id
flowcell lane
tile number within the flowcell lane
x-coordinate of the cluster within the tile
y-coordinate of the cluster within the tile
Member of a read/mate pair, 1 or 2
Y if the read is filtered, N otherwise
0 when none of the control bits are on
Index sequence

Workflow - QC

We will run QC using Fastqc and trim_galore.

trim_galore will also run fastqc on the trimmed dataset.

while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/fastqc.sh ${srr_id}
  sbatch code/trim_galore.sh $samplename $fq1 $fq2
done < <(tail -n +2 raw/metadata.csv)
#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=04:00:00     # 4 hours
#SBATCH --job-name="fastqc"
#SBATCH --output=log/%x_%j.log
#SBATCH -p batch # This is the default partition, you can use any of the following; intel, batch, highmem, gpu

module load fastqc

# Load variables
SRR_ID=$1
IN_DIR=raw
OUT_DIR=analysis/fastqc

# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"

# -t : threads
# --noextract : do not extract zip output files

fastqc -f fastq -t 4 --noextract -o $OUT_DIR $IN_DIR/${SRR_ID}*.fastq.gz

Good data example

Bad data example

$ ls analysis/fastqc
SRR24016000_1_fastqc.html
SRR24016000_1_fastqc.zip
SRR24016000_2_fastqc.html
SRR24016000_2_fastqc.zip
SRR24016001_1_fastqc.html
SRR24016001_1_fastqc.zip
SRR24016001_2_fastqc.html
SRR24016001_2_fastqc.zip
SRR24016002_1_fastqc.html
SRR24016002_1_fastqc.zip
SRR24016002_2_fastqc.html
SRR24016002_2_fastqc.zip
SRR24016003_1_fastqc.html
SRR24016003_1_fastqc.zip
SRR24016003_2_fastqc.html
SRR24016003_2_fastqc.zip
SRR24016004_1_fastqc.html
SRR24016004_1_fastqc.zip
SRR24016004_2_fastqc.html
SRR24016004_2_fastqc.zip
SRR24016005_1_fastqc.html
SRR24016005_1_fastqc.zip
SRR24016005_2_fastqc.html
SRR24016005_2_fastqc.zip
#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --time=06:00:00     # 6 hours
#SBATCH --job-name="trim_galore"
#SBATCH --output=log/%x_%j.log
#SBATCH -p batch # This is the default partition, you can use any of the following; intel, batch, highmem, gpu

################### Trim_galore.sh ###################

# Load conda environment
module load trim_galore

# Reading in variables
SAMPLENAME=$1
FQ1=$2
FQ2=$3
OUT_DIR=analysis/trim_galore

# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"

# Running trim_galore
trim_galore --fastqc --gzip --trim-n -j 4 --paired -o $OUT_DIR --basename ${SAMPLENAME} ${FQ1} ${FQ2}

# Running trim_galore (single-end)
# trim_galore --fastqc --gzip --trim-n -j 4 -o $OUT_DIR --basename ${SAMPLENAME} ${FQ1}

The ‘val_1’ and ‘val_2’ files are the validated files after trim_galore processing. These files will be the input for the STAR alignment.

$ ls analysis/trim_galore
mir163_1_val_1_fastqc.html
mir163_1_val_1_fastqc.zip
mir163_1_val_1.fq.gz
mir163_1_val_2_fastqc.html
mir163_1_val_2_fastqc.zip
mir163_1_val_2.fq.gz
mir163_2_val_1_fastqc.html
mir163_2_val_1_fastqc.zip
mir163_2_val_1.fq.gz
mir163_2_val_2_fastqc.html
mir163_2_val_2_fastqc.zip
mir163_2_val_2.fq.gz
mir163_3_val_1_fastqc.html
mir163_3_val_1_fastqc.zip
mir163_3_val_1.fq.gz
mir163_3_val_2_fastqc.html
mir163_3_val_2_fastqc.zip
mir163_3_val_2.fq.gz
SRR24016000_1.fastq.gz_trimming_report.txt
SRR24016000_2.fastq.gz_trimming_report.txt
SRR24016001_1.fastq.gz_trimming_report.txt
SRR24016001_2.fastq.gz_trimming_report.txt
SRR24016002_1.fastq.gz_trimming_report.txt
SRR24016002_2.fastq.gz_trimming_report.txt
SRR24016003_1.fastq.gz_trimming_report.txt
SRR24016003_2.fastq.gz_trimming_report.txt
SRR24016004_1.fastq.gz_trimming_report.txt
SRR24016004_2.fastq.gz_trimming_report.txt
SRR24016005_1.fastq.gz_trimming_report.txt
SRR24016005_2.fastq.gz_trimming_report.txt
wt_1_val_1_fastqc.html
wt_1_val_1_fastqc.zip
wt_1_val_1.fq.gz
wt_1_val_2_fastqc.html
wt_1_val_2_fastqc.zip
wt_1_val_2.fq.gz
wt_2_val_1_fastqc.html
wt_2_val_1_fastqc.zip
wt_2_val_1.fq.gz
wt_2_val_2_fastqc.html
wt_2_val_2_fastqc.zip
wt_2_val_2.fq.gz
wt_3_val_1_fastqc.html
wt_3_val_1_fastqc.zip
wt_3_val_1.fq.gz
wt_3_val_2_fastqc.html
wt_3_val_2_fastqc.zip
wt_3_val_2.fq.gz

Workflow - STAR Indexing

To run the STAR aligner, we first need to create an index of the genome.

To generate an index, you’ll need the following files:

FASTA: Fasta file containing sequences of all the chromosomes/scaffolds for the genome

GTF/GFF: General Transfer Format (GTF) or General Feature Format (GFF)

Note

You only need to run the indexing step once. The indexed files can be used for alignments for all subsequent samples

Warning

For assembled genomes containing large number of scaffolds or large genomes, this process can take some time and may require adjusting the indexing parameters.

sbatch code/STAR-index.sh
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=32
#SBATCH --mem=100g
#SBATCH --time=1-23:00:00
#SBATCH --job-name="STAR-index"
#SBATCH --output=log/%x_%j.log

## Load environment
module load star

NAME=ara11
FASTA=genome/genome/TAIR10_Chr.all.fasta
GTF=genome/Araport11_GFF3_genes_transposons.201606.gtf
GFF=genome/genome/Araport11_GFF3_genes_transposons.201606.gff
IDX_DIR=index

PARAMS="--runThreadN 32
    --runMode genomeGenerate
        --genomeDir $IDX_DIR/$NAME
        --genomeFastaFiles $FASTA
        --sjdbGTFfile $GTF
        --sjdbOverhang 99
        --genomeSAindexNbases 12"

STAR $PARAMS
>Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCT
TGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTA
CATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTA
TGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATT
TAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAG
Chr1    Araport11       5UTR    3631    3759    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    3631    3913    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       start_codon     3760    3762    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       CDS     3760    3913    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    3996    4276    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       CDS     3996    4276    .       +       2       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    4486    4605    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       CDS     4486    4605    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    4706    5095    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       CDS     4706    5095    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    5174    5326    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       CDS     5174    5326    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       CDS     5439    5630    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    5439    5899    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       stop_codon      5628    5630    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       3UTR    5631    5899    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010"
Chr1    Araport11       exon    6788    7069    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020"
Chr1    Araport11       3UTR    6788    7069    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020"
Chr1    Araport11       3UTR    7157    7314    .       -       .       transcript_id "AT1G01020.2"; gene_id "AT1G01020"
##gff-version   3
Chr1    Araport11       gene    3631    5899    .       +       .       ID=AT1G01010;Name=AT1G01010;Note=NAC domain containing protein 1;symbol=NAC001;Alias=ANAC001,NAC domain containing protein 1;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1 (NAC001)%3B FUNCTIONS IN: sequence-specific DNA binding transcription factor activity%3B INVOLVED IN: multicellular organismal development%2C regulation of transcription%3B LOCATED IN: cellular_component unknown%3B EXPRESSED IN: 7 plant structures%3B EXPRESSED DURING: 4 anthesis%2C C globular stage%2C petal differentiation and expansion stage%3B CONTAINS InterPro DOMAIN/s: No apical meristem (NAM) protein (InterPro:IPR003441)%3B BEST Arabidopsis thaliana protein match is: NAC domain containing protein 69 (TAIR:AT4G01550.1)%3B Has 2503 Blast hits to 2496 proteins in 69 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 2502%3B Viruses - 0%3B Other Eukaryotes - 1 (source: NCBI BLink).;Dbxref=PMID:11118137,PMID:12820902,PMID:15029955,PMID:15010618,PMID:15108305,PMID:15173566,PMID:15282545,PMID:16504176,PMID:16547105,PMID:16524982,PMID:16720694,PMID:16552445,PMID:17339215,PMID:17448460,PMID:17447913,PMID:18650403,PMID:20736450,PMID:24377444,locus:2200935;locus_type=protein_coding
Chr1    Araport11       mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Note=NAC domain containing protein 1;conf_class=2;symbol=NAC001;Alias=ANAC001,NAC domain containing protein 1;full_name=NAC domain containing protein 1;computational_description=NAC domain containing protein 1 (NAC001)%3B FUNCTIONS IN: sequence-specific DNA binding transcription factor activity%3B INVOLVED IN: multicellular organismal development%2C regulation of transcription%3B LOCATED IN: cellular_component unknown%3B EXPRESSED IN: 7 plant structures%3B EXPRESSED DURING: 4 anthesis%2C C globular stage%2C petal differentiation and expansion stage%3B CONTAINS InterPro DOMAIN/s: No apical meristem (NAM) protein (InterPro:IPR003441)%3B BEST Arabidopsis thaliana protein match is: NAC domain containing protein 69 (TAIR:AT4G01550.1)%3B Has 2503 Blast hits to 2496 proteins in 69 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 2502%3B Viruses - 0%3B Other Eukaryotes - 1 (source: NCBI BLink).;conf_rating=****;Dbxref=PMID:11118137,gene:2200934,UniProt:Q0WV96
Chr1    Araport11       five_prime_UTR  3631    3759    .       +       .       ID=AT1G01010:five_prime_UTR:1;Parent=AT1G01010.1;Name=NAC001:five_prime_UTR:1
Chr1    Araport11       exon    3631    3913    .       +       .       ID=AT1G01010:exon:1;Parent=AT1G01010.1;Name=NAC001:exon:1
Chr1    Araport11       CDS     3760    3913    .       +       0       ID=AT1G01010:CDS:1;Parent=AT1G01010.1;Name=NAC001:CDS:1
Chr1    Araport11       exon    3996    4276    .       +       .       ID=AT1G01010:exon:2;Parent=AT1G01010.1;Name=NAC001:exon:2
Chr1    Araport11       CDS     3996    4276    .       +       2       ID=AT1G01010:CDS:2;Parent=AT1G01010.1;Name=NAC001:CDS:2
Chr1    Araport11       exon    4486    4605    .       +       .       ID=AT1G01010:exon:3;Parent=AT1G01010.1;Name=NAC001:exon:3
Chr1    Araport11       CDS     4486    4605    .       +       0       ID=AT1G01010:CDS:3;Parent=AT1G01010.1;Name=NAC001:CDS:3
Chr1    Araport11       exon    4706    5095    .       +       .       ID=AT1G01010:exon:4;Parent=AT1G01010.1;Name=NAC001:exon:4
Chr1    Araport11       CDS     4706    5095    .       +       0       ID=AT1G01010:CDS:4;Parent=AT1G01010.1;Name=NAC001:CDS:4
Chr1    Araport11       exon    5174    5326    .       +       .       ID=AT1G01010:exon:5;Parent=AT1G01010.1;Name=NAC001:exon:5
Chr1    Araport11       CDS     5174    5326    .       +       0       ID=AT1G01010:CDS:5;Parent=AT1G01010.1;Name=NAC001:CDS:5
Chr1    Araport11       CDS     5439    5630    .       +       0       ID=AT1G01010:CDS:6;Parent=AT1G01010.1;Name=NAC001:CDS:6
Chr1    Araport11       exon    5439    5899    .       +       .       ID=AT1G01010:exon:6;Parent=AT1G01010.1;Name=NAC001:exon:6
Chr1    Araport11       three_prime_UTR 5631    5899    .       +       .       ID=AT1G01010:three_prime_UTR:1;Parent=AT1G01010.1;Name=NAC001:three_prime_UTR:1

Workflow - STAR Alignment

Now we can align the reads from each sample to the reference genome.

while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/STAR-align.sh $samplename
done < <(tail -n +2 raw/metadata.csv)
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=16
#SBATCH --mem=100g
#SBATCH --time=0-8:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="STAR-align"
#SBATCH --output=log/%x_%j.log
##################################################################

## Load STAR environment
module load star

## Setting variables
SAMPLENAME=$1
IN_DIR=analysis/trim_galore
OUT_DIR=analysis/star
IDX_DIR=index

# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"

## Setting STAR parameters

PARAMS=" --runThreadN 16
         --runMode alignReads
         --genomeDir ${IDX_DIR}
         --readFilesCommand zcat
         --outSAMtype BAM SortedByCoordinate
         --outFileNamePrefix ${OUT_DIR}/${SAMPLENAME}_
         --outFilterMismatchNmax 0
         --outFilterMultimapNmax 1
         --quantMode GeneCounts
         --outWigType wiggle"

## Running STAR
STAR $PARAMS --readFilesIn $IN_DIR/${SAMPLENAME}_val_1.fq.gz $IN_DIR/${SAMPLENAME}_val_2.fq.gz

## Indexing bam files
samtools index $OUT_DIR/${SAMPLENAME}_*bam
wt_1_Aligned.sortedByCoord.out.bam
wt_1_Aligned.sortedByCoord.out.bam.bai
wt_1_Log.final.out
wt_1_Log.out
wt_1_Log.progress.out
wt_1_ReadsPerGene.out.tab
wt_1_Signal.UniqueMultiple.str1.out.wig
wt_1_Signal.UniqueMultiple.str2.out.wig
wt_1_Signal.Unique.str1.out.wig
wt_1_Signal.Unique.str2.out.wig
wt_1_SJ.out.tab
                                 Started job on |       Oct 16 15:01:05
                             Started mapping on |       Oct 16 15:01:06
                                    Finished on |       Oct 16 15:14:11
       Mapping speed, Million of reads per hour |       94.20

                          Number of input reads |       20540486
                      Average input read length |       297
                                    UNIQUE READS:
                   Uniquely mapped reads number |       19330715
                        Uniquely mapped reads % |       94.11%
                          Average mapped length |       291.75
                       Number of splices: Total |       18632044
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       18418926
                       Number of splices: GC/AG |       175340
                       Number of splices: AT/AC |       8071
               Number of splices: Non-canonical |       29707
                      Mismatch rate per base, % |       0.00%
                         Deletion rate per base |       0.00%
                        Deletion average length |       1.51
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.22
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       0
             % of reads mapped to multiple loci |       0.00%
        Number of reads mapped to too many loci |       389809
             % of reads mapped to too many loci |       1.90%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       819554
                 % of reads unmapped: too short |       3.99%
                Number of reads unmapped: other |       408
                     % of reads unmapped: other |       0.00%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Anatomy of a SAM/BAM file

Sequence alignment map (SAM) format official documentation is available here. The SAM file consists of a header and rows for each read. Each row contains 11 mandatory fields. BAM is the binary compressed version of SAM.

For decoding the SAM Flags, check out this website from the Broad Institute.

A00738:657:H72KHDSX7:1:2513:31566:1110  163     chr01   2960    255     150M    =       3065    254     CACCCACAGGCACCACCGTCCTTGTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGG  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NH:i:1  HI:i:1  AS:i:297        nM:i:0
A00738:657:H72KHDSX7:1:1330:26928:13870 99      chr01   2983    255     149M    =       3071    223     GTTGGTAATGAAGAAGACGAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF   NH:i:1  HI:i:1  AS:i:282        nM:i:0
A00738:657:H72KHDSX7:1:1159:8594:10551  163     chr01   3001    255     150M    =       3069    218     GAGACGACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGC  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  NH:i:1  HI:i:1  AS:i:298        nM:i:0
A00738:657:H72KHDSX7:1:1640:6551:15311  163     chr01   3006    255     149M    =       3088    232     GACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGAC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:1  HI:i:1  AS:i:297        nM:i:0
A00738:657:H72KHDSX7:1:2574:25084:30937 99      chr01   3007    255     149M    =       3207    448     ACGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFF:FF   NH:i:1  HI:i:1  AS:i:297        nM:i:0
A00738:657:H72KHDSX7:1:1327:12717:27023 99      chr01   3008    255     149M    =       3086    227     CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF   NH:i:1  HI:i:1  AS:i:296        nM:i:0


A00738:657:H72KHDSX7:1:1327:12717:27023
99      
chr01
3008
255
149M
= 
3086
227   
CGACTTCCCCACTAGGAAACACGACGGAGGCGGAGATGATCGACGGCGGAGAGAGCTACAGAAACATCGATGCCTCCTGTCCAATCCCCCCATCCCATTCGGTAGTTGGATTGAAGACTACCGAATAAGAGAAGCAGGCAGGCAGACAA
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF
NH:i:1  HI:i:1  AS:i:296  nM:i:0
Sequence identifier
SAM FLAG
Reference sequence name (chromosome/scaffold)
1-based leftmost mapping position
Mapping quality (MAPQ)
CIGAR string
Reference name of mate
Position of mate
Observed Template length
Sequence
Base quality
Optional fields

Workflow - Convert BAM to BigWig (Optional)

The BAM file produced from STAR contains alignment information that can be viewed in IGV. However, this file can be big and sluggish so one option is to convert the BAM file to a smaller BigWig format that will allow fast visualization of the results.

Note

The BigWig format will aggregate and bin your data across the genome. Therefore, you will lose individual read information. For observing SNPs, you’ll want to use the BAM files instead.


while IFS=, read -r srr_id eco geno trt tiss biorep samplename fq1 fq2
do
  sbatch code/bamTobw.sh $samplename
done < <(tail -n +2 raw/metadata.csv)
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=100G
#SBATCH --time=02:00:00
#SBATCH --job-name="bam2bw"
#SBATCH --output=log/%x_%j.log
#SBATCH -p batch # This is the default partition, you can use any of the following; intel, batch, highmem, gpu

# Shell script to generate BigWig files from STAR bam output
# Since the dataset is non-stranded, we can convert directly from the BAM file

# Load deeptools
module load deeptools

# Set variables
SAMPLENAME=$1
IN_DIR=analysis/star

bamCoverage -b $IN_DIR/${SAMPLENAME}_Aligned.sortedByCoord.out.bam \
    -o $IN_DIR/${SAMPLENAME}.bw \
    -p 4 \
    --normalizeUsing CPM \
    --exactScaling \
    --skipNAs

Day Two

Workflow Outline

  • Create metadata
  • QC using Fastqc and trim_galore
  • Genome indexing for STAR aligner
  • Sequence alignment with STAR
  • Gene quantification using featureCounts
  • Differential expression analysis with DESeq2
  • Data visualization in IGV

Workflow - featureCounts

After aligning the reads to the genome, we must quantify total reads associated with each genome feature (e.g., gene).


sbatch code/featurecounts.sh
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=2
#SBATCH --mem=12g
#SBATCH --time=0-4:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="featurecount"
#SBATCH --output=log/%x_%j.log

##################################################################
## Shell script to quantify read counts with FeatureCounts

## Load modules
module load subread # v2.0.3

## Setting variables
IN_DIR=analysis/star
OUT_DIR=analysis/featurecounts
STRAND=0 # (0-unstrand, 1-strand, 2-reverse strand)
GTF=genome/Araport11_GFF3_genes_transposons.201606.gtf

# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"

## Set featuercounts params
# -p : indicates paired-end reads
# --countReadPairs : count read pairs instead of reads
# --primary : count primary alignments only

PARAMS="-T 2 \
        -s $STRAND \
        -a $GTF \
        -p \
        -t exon \
        -g gene_id \
        --countReadPairs \
        --primary \
        -o ${OUT_DIR}/featurecounts.txt"

## Running featureCounts
featureCounts ${PARAMS} $IN_DIR/*bam
 #Program:featureCounts v2.0.6; Command:"featureCounts" "-T" "2" "-s" "0" "-a" "genome/Araport11_GFF3_genes_transposons.201606.gtf" "-p" "-t" "exon" "-g" "gene_id" "--countReadPairs" "--primary" "-o" "analysis/featurecounts/featurecounts.txt" "analysis/star/mir163_1_Aligned.sortedByCoord.out.bam" "analysis/star/mir163_2_Aligned.sortedByCoord.out.bam" "analysis/star/mir163_3_Aligned.sortedByCoord.out.bam" "analysis/star/wt_1_Aligned.sortedByCoord.out.bam" "analysis/star/wt_2_Aligned.sortedByCoord.out.bam" "analysis/star/wt_3_Aligned.sortedByCoord.out.bam" 
Geneid  Chr     Start   End     Strand  Length  analysis/star/mir163_1_Aligned.sortedByCoord.out.bam        analysis/star/mir163_2_Aligned.sortedByCoord.out.bam        analysis/star/mir163_3_Aligned.sortedByCoord.out.bam        analysis/star/wt_1_Aligned.sortedByCoord.out.bam    analysis/star/wt_2_Aligned.sortedByCoord.out.bam        analysis/star/wt_3_Aligned.sortedByCoord.out.bam
AT1G01010       Chr1;Chr1;Chr1;Chr1;Chr1;Chr1   3631;3996;4486;4706;5174;5439       3913;4276;4605;5095;5326;5899   +;+;+;+;+;+     1688185     186     248     310     140     258
AT1G01020       Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1     6788;6788;6788;6788;6788;6788;7157;7157;7157;7157;7157;7157;7384;7384;7384;7384;7564;7564;7564;7564;7564;7564;7762;7762;7762;7762;7762;7942;7942;7942;7942;7942;8236;8236;8236;8236;8236;8236;8417;8417;8417;8417;8571;8571;8571;8594;8594;8594 7069;7069;7069;7069;7069;7069;7232;7232;7232;7232;7450;7450;7450;7450;7450;7450;7649;7649;7649;7649;7649;7649;7835;7835;7835;7835;7835;7987;7987;7987;7987;7987;8464;8325;8464;8325;8325;8325;8464;8464;8464;8464;9130;9130;8737;9130;9130;8737     -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;- 1571    294     265     413     458     293334
AT1G03987       Chr1    11101   11372   +       272     23      15 17       29      16      20
AT1G01030       Chr1;Chr1;Chr1;Chr1;Chr1        11649;11649;12424;13335;13335       12354;13173;13173;13714;13714   -;-;-;-;-       1905188     216     234     262     192     205
AT1G01040       Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1     23121;23416;24542;24542;24752;24752;25041;25041;25524;25524;25825;25825;26081;26081;26292;26292;26543;26543;26862;26862;27099;27099;27372;27372;27618;27618;27803;27803;28708;28708;28890;28890;29160;29160;30147;30147;30410;30410;30902;30902     24451;24451;24655;24655;24962;24962;25435;25435;25743;25743;25997;25997;26203;26203;26452;26452;26776;26776;27012;27012;27281;27281;27536;27533;27713;27713;28431;28431;28805;28805;29080;29080;30065;30065;30311;30311;30816;30816;31120;31227 +;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+;+     6279    1066    1034    14131462    864     1138
AT1G03993       Chr1    23312   24099   -       788     0       0  00       0       0
AT1G01050       Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1;Chr1   31170;31382;31521;31521;31693;31693;31933;31933;32088;32088;32282;32282;32431;32431;32547;32547;32839;33029     31424;31424;31602;31602;31813;31813;31998;31998;32195;32195;32347;32347;32459;32459;32670;32670;33009;33171 -;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;- 1165    999     1218    1258    1130    11611269
AT1G03997       Chr1    32727   33009   +       283     0       0  00       0       0

Workflow - Running DESeq2

DESeq2 is an R package that allows differential expression analysis.

sbatch code/DESeq2.sh
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=16
#SBATCH --mem=100g
#SBATCH --time=0-8:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="deseq2"
#SBATCH --output=log/%x_%j.log
##################################################################

####################
# Executing Rscript
# conda activate DEG-analysis
# Rscript --vanilla DESeq2.R
####################

# Load conda environment
source activate DEG-analysis

# Run R script
Rscript --vanilla code/DESeq2.R
#!/usr/bin/env Rscript

# Load libaries
library(tidyverse)
library(DESeq2)
library(RColorBrewer)
library(gplots)
library(EnhancedVolcano)
library(ggfortify)
library(glue)
library(plotly)

# Load count matrix file from featurecounts and clean up the header
counts <- read_tsv("analysis/featurecounts/featurecounts.txt",
          col_names=TRUE, skip=1)

new_name <- str_extract(names(counts)[endsWith(names(counts), "bam")],
            "(?<=star/).*(?=_Aligned)")

names(counts)[endsWith(names(counts), "bam")] <- new_name

counts <- counts %>%
  select(-Chr, -Start, -End, -Strand, -Length) %>%
  column_to_rownames(var = "Geneid") %>%
  as.matrix()

# Load sample info and metadata

sampleinfo <- read_csv("raw/metadata.csv", col_names=TRUE)

sampleinfo <- sampleinfo %>%
  mutate(genotype = factor(genotype, levels = c("wt", "miR163_mut"))) %>%
  arrange(samplename) %>%
  column_to_rownames(var = "samplename") %>%
  as.matrix()

write.table(as.data.frame(counts),
  file = "analysis/featurecounts/count_matrix.csv",
  sep = ",",
  quote = FALSE,
  row.names = F)

# Load annotation file
annot <- read_csv("genome/ath_gene_annot.csv", col_names=TRUE)

# Construct DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = sampleinfo,
                              design = ~ genotype)

# Reorder factor level
dds$genotype<- relevel(dds$genotype, ref = "wt")

# Pre-filter low counts
keep <- rowSums(counts(dds)) >= 5
dds <- dds[keep,]

dds <- estimateSizeFactors(dds)

normalized.counts <- as.data.frame(counts(dds, normalized=TRUE)) %>%
  as_tibble(rownames = "locus") %>%
  left_join(., annot, c('locus'='tair_locus'))

# write to file
write.table(normalized.counts,"analysis/deseq2/genelists/DESeq2_normalizedcounts.txt", 
            sep = "\t", 
            quote = FALSE,
            row.names = FALSE)

############## Pre-Processing Quality Assessment (START) ##############

## Library sizes bar plot

librarySizes <- colSums(counts)

pdf("analysis/deseq2/plots/lib-size.barplots.pdf", width = 7, height = 7)
barplot(librarySizes,
        names=names(librarySizes),
        las=2,
        main="Barplot of library sizes")
abline(h=20e6, lty=2)
dev.off()

png("analysis/deseq2/plots/lib-size.barplots.png")
barplot(librarySizes,
        names=names(librarySizes),
        las=2,
        main="Barplot of library sizes")
abline(h=20e6, lty=2)
dev.off()

## Count distribution Boxplots

logcounts <- log2(counts + 1) # no normalization
title_boxplot <- paste("Count Distribution Boxplots")

pdf("analysis/deseq2/plots/count-dist.boxplots.raw.pdf", width = 7, height = 7)
boxplot(logcounts,
        main=title_boxplot,
        xlab="",
        ylab="Log2(Counts+1)",
        las=2,
        col=c(rep("red",3), rep("blue",3)))
abline(h=median(as.matrix(logcounts)), col="blue")
dev.off()

png("analysis/deseq2/plots/count-dist.boxplots.raw.png")
boxplot(logcounts,
        main=title_boxplot,
        xlab="",
        ylab="Log2(Counts+1)",
        las=2,
        col=c(rep("red",3),rep("blue",3)))
abline(h=median(as.matrix(logcounts)), col="blue")
dev.off()

rlogcounts <- rlog(counts) #normalization
title_boxplot_rlog <- paste("Count Distribution Boxplots (rlog)")

pdf("analysis/deseq2/plots/count-dist.boxplots.rlog.pdf", width = 7, height = 7)
boxplot(rlogcounts,
        main=title_boxplot_rlog,
        xlab="",
        ylab="rlog(Counts)",
        las=2,
        col=c(rep("red",3),rep("blue",3)))
dev.off()

png("analysis/deseq2/plots/count-dist.boxplots.rlog.png")
boxplot(rlogcounts,
        main=title_boxplot_rlog,
        xlab="",
        ylab="rlog(Counts)",
        las=2,
        col=c(rep("red",3),rep("blue",3)))
dev.off()


## Correlation heatmaps

colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
rld <- rlog(dds, blind=TRUE)
corr_samps <- cor(as.matrix(assay(rld)))

pdf("analysis/deseq2/plots/cor-heatmap.pdf", width = 7, height = 7)
heatmap.2(corr_samps,
          trace = "none",
          col = colors,
          main = "Sample correlations")
dev.off()

png("analysis/deseq2/plots/cor-heatmap.png")
heatmap.2(corr_samps,
          trace = "none",
          col = colors,
          main = "Sample correlations")
dev.off()

## Principal Component Analysis

# autoplot requires the ggfortify library

pcDat <- prcomp(t(rlogcounts))
title_pca <- paste("PC1 vs PC2, All Samples")

pdf("analysis/deseq2/plots/PCA.pdf", width = 7, height = 7)
autoplot(pcDat,
         main=title_pca,
         data = sampleinfo,
         colour="genotype",
         shape=FALSE)
dev.off()

png("analysis/deseq2/plots/PCA.png")
autoplot(pcDat,
         main=title_pca,
         data = sampleinfo,
         colour="genotype",
         shape=FALSE)
dev.off()

## Plotly version of PCA plot
p <- autoplot(pcDat,
         main=title_pca,
         data = sampleinfo,
         colour = "genotype",
         label = F,
         label.repel = F,
         size = 8,
         group = "time",
         shape = "time")

plt <- plotly::ggplotly(p)
htmlwidgets::saveWidget(plt, "analysis/deseq2/plots/PCA.html")

############## Pre-Processing Quality Assessment (END) ##############


############## Differential Expression (START)##############

# Calculate DE for all contrasts
dds <- DESeq(dds)

# Extract different contrast (i.e., comparisons)
contrast_list <- resultsNames(dds)
contrast_list <- contrast_list[contrast_list != "Intercept"] # remove Intercept




# Function to extract info for each contrast

deseq2_deg <- function(contrast_name){
  # extracting DEG results
  res <- results(dds, name=contrast_name)
  res05 <- results(dds, name=contrast_name, lfcThreshold = 1, alpha = 0.05)
  res05$padj <- p.adjust(res05$pvalue, method="BH")
  res05 <- res05[!is.na(res05$padj),]

  res_filt <- res[ which(res$padj < 0.1), ] # total DEGs
  res05_filt <- res05[ which(res05$padj < 0.05), ] # total DEGs
  up <- subset(res_filt, log2FoldChange > 0) # upDEGs (LFC > 0, p<0.1)
  dn <- subset(res_filt, log2FoldChange < 0) # dnDEGs (LFC < 0, p<0.1)
  up_05 <- subset(res05_filt, log2FoldChange > 1) # upDEGs (LFC > 1, p<0.05)
  dn_05 <- subset(res05_filt, log2FoldChange < -1) # dnDEGs (LFC< -1, p<0.05)

  # generates summary file
  summary_file <- c(contrast_name,
                    nrow(res_filt),
                    nrow(up),
                    nrow(dn),
                    nrow(res05_filt),
                    nrow(up_05),
                    nrow(dn_05))

  # export lists
  degs <- subset(res, padj < 0.1) %>%
          as_tibble(rownames = "locus") %>%
          left_join(., annot, c('locus'='tair_locus'))

  degs_05 <- subset(res05, padj < 0.05, log2FoldChange > 1) %>%
             as_tibble(rownames = "locus") %>%
             left_join(., annot, c('locus'='tair_locus'))

  write.table(degs,
              file = paste0("analysis/deseq2/genelists/",contrast_name,"_DEGs.LFC0_p1.txt"),
              quote = FALSE,
              sep= "\t",
              row.names = FALSE)
  write.table(degs_05,
              file = paste0("analysis/deseq2/genelists/",contrast_name,"_DEGs.LFC1_p05.txt"),
              quote = FALSE,
              sep= "\t",
              row.names = FALSE)

  # post-processing plots

  # volcano plots with EnhancedVolcano

  subtitle1 <- paste("LFC > 0, FDR < 0.1")
  subtitle2 <- paste("LFC > 1, FDR < 0.05")

  EnhancedVolcano(res,
      lab = rownames(res),
      x = 'log2FoldChange',
      y = 'pvalue',
      title = contrast_name,
      subtitle = subtitle1,
      pCutoff = 0.1,
      pCutoffCol = 'padj', # p-value
      FCcutoff = 0,
      pointSize = 4.0,
      labSize = 6.0,
      drawConnectors = TRUE,
      widthConnectors = 0.75)

  ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC0-FDR1.pdf"), width = 7, height = 7)
  ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC0-FDR1.png"), scale = 2)

  EnhancedVolcano(res05,
      lab = rownames(res05),
      x = 'log2FoldChange',
      y = 'pvalue',
      title = contrast_name,
      subtitle = subtitle2,
      pCutoff = 0.05,
      pCutoffCol = 'padj', # p-value
      FCcutoff = 1,
      pointSize = 4.0,
      labSize = 6.0,
      drawConnectors = TRUE,
      widthConnectors = 0.75)

  ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC1-FDR05.pdf"), width = 7, height = 7)
  ggsave(paste0("analysis/deseq2/plots/",contrast_name,"-volcano-LFC1-FDR05.png"), scale = 2)

  return(summary_file)
}

# Run function for all contrast
sum_file <- sapply(contrast_list, deseq2_deg)

sum_file_tbl <- sum_file %>%
  t() %>%
  as_tibble() %>%
  filter(V1 != "Intercept") %>%
  select("Comparison"=V1,
         "Total DEGs (p < 0.1)"=V2,
         "Up DEGs (p < 0.1, LFC > 0)"=V3,
         "Down DEGs (p < 0.1, LFC < 0)"=V4,
         "Total DEGs (p < 0.05)"=V5,
         "Up DEGs (p < 0.05, LFC > 1)"=V6,
         "Down DEGs (p < 0.05, LFC < -1)"=V7) %>%
  mutate(across(2:7, as.integer))

write.table(sum_file_tbl,
            file = "analysis/deseq2/genelists/DESeq2.DEG_summary.txt",
            sep = "\t",
            row.names = FALSE,
            quote = FALSE)

Workflow - DESeq2 (pre-processing)

The custom DESeq2 script generates several plots, tables, and genelists organized into the following directories.


analysis/deseq2
├── genelists
└── plots

Lets examine some of the pre-processing information about the dataset.


Workflow - DESeq2 (DEG)

DEGs are defined using two filters:

  • log2foldchange > 0, FDR (adjusted-pvalue) < 0.1
  • log2foldchange > 1, FDR < 0.05


$ls analysis/deseq2/genelists
DESeq2.DEG_summary.txt
DESeq2_normalizedcounts.txt
genotype_miR163_mut_vs_wt_DEGs.LFC0_p1.txt
genotype_miR163_mut_vs_wt_DEGs.LFC1_p05.txt


$ cat analysis/deseq2/genelists/DESeq2.DEG_summary.txt
Comparison      Total DEGs (p < 0.1)    Up DEGs (p < 0.1, LFC > 0)      Down DEGs (p < 0.1, LFC < 0)    Total DEGs (p < 0.05)   Up DEGs (p < 0.05, LFC > 1)     Down DEGs (p < 0.05, LFC < -1)
genotype_miR163_mut_vs_wt       11      6       5    5       3       2

Workflow - MultiQC

MultiQC is used to generate a summary of the raw and processed data. The results are provided in an HTML file.

sbatch code/multiqc.sh
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --cpus-per-task=1
#SBATCH --mem=12g
#SBATCH --time=0-4:00:00
#SBATCH --mail-type=ALL
#SBATCH --job-name="multiqc"
#SBATCH --output=log/%x_%j.log

##################################################################
## Shell script to generate MultiQC summary

## Load environment
source activate DEG-analysis

## Setting variables
OUT_DIR=analysis/multiqc

# Create analysis folder (if it doesn't exist)
[ ! -d "$OUT_DIR" ] && mkdir -p "$OUT_DIR"

# Run multiqc
multiqc --force \
    --dirs \
    --outdir ${OUT_DIR} \
    --filename multiqc \
    analysis

Workflow - Visualization with IGV

To inspect and examine your RNA-seq results, you can load the BigWig or BAM files into IGV.

Show desktop demo